116 research outputs found
Multilingual Sentence Categorization according to Language
In this paper, we describe an approach to sentence categorization which has
the originality to be based on natural properties of languages with no training
set dependency. The implementation is fast, small, robust and textual errors
tolerant. Tested for french, english, spanish and german discrimination, the
system gives very interesting results, achieving in one test 99.4% correct
assignments on real sentences.
The resolution power is based on grammatical words (not the most common
words) and alphabet. Having the grammatical words and the alphabet of each
language at its disposal, the system computes for each of them its likelihood
to be selected. The name of the language having the optimum likelihood will tag
the sentence --- but non resolved ambiguities will be maintained. We will
discuss the reasons which lead us to use these linguistic facts and present
several directions to improve the system's classification performance.
Categorization sentences with linguistic properties shows that difficult
problems have sometimes simple solutions.Comment: 4 pages --- LaTe
Daniel@FinTOC-2019 Shared Task : TOC Extraction and Title Detection
International audienceWe present different methods for the two tasks of the 2019 FinTOC challenge: Title Detection and Table of Contents Extraction. For the Title Detection task we present different approaches using various features : visual characteristics , punctuation density and character n-grams. Our best approach achieved an official F-measure score of 94.88%, ranking 6 on this task. For the TOC extraction task, we presented a method combining visual characteristics of the document layout. With this method we ranked first on this task with 42.72%
Web Page Segmentation for Non Visual Skimming
International audienceWeb page segmentation aims to break a page into smaller blocks, in which contents with coherent semantics are kept together. Examples of tasks targeted by such a technique are advertisement detection or main content extraction. In this paper, we study different seg-mentation strategies for the task of non visual skimming. For that purpose, we consider web page segmentation as a clustering problem of visual elements, where (1) all visual elements must be clustered, (2) a fixed number of clusters must be discovered, and (3) the elements of a cluster should be visually connected. Therefore, we study three different algorithms that comply to these constraints: K-means, F-K-means, and Guided Expansion. Evaluation shows that Guided Expansion evidences statistically-relevant results in terms of compactness and separateness, and satisfies more logical constraints when compared to the other strategies
Concurrent Speech Synthesis to Improve Document First Glance for the Blind
International audienceSkimming and scanning are two well-known reading processes, which are combined to access the document content as quickly and efficiently as possible. While both are available in visual reading mode, it is rather difficult to use them in non visual environments because they mainly rely on typographical and layout properties. In this article, we introduce the concept of tag thunder as a way (1) to achieve the oral transposition of the web 2.0 concept of tag cloud and (2) to produce an innovative interactive stimulus to observe the emergence of self-adapted strategies for non-visual skimming of written texts. We first present our general and theoretical approach to the problem of both fast, global and non-visual access to web browsing; then we detail the progress of development and evaluation of the various components that make up our software architecture. We start from the hypothesis that the semantics of the visual architecture of web pages can be transposed into new sensory modalities thanks to three main steps (web page segmentation, keywords extraction and sound spatialization). We note the difficulty of simultaneously (1) evaluating a modular system as a whole at the end of the processing chain and (2) identifying at the level of each software brick the exact origin of its limits; despite this issue, the results of the first evaluation campaign seem promising
The Stakes of multilinguality: Multilingual text tokenization in Natural Language Diagnosis
International audienc
L'analyse automatique de forums de discussion dans un contexte pédagogique
International audienc
Méthode pour l'analyse automatique de structures formelles sur documents multilingues
This thesis deals with automatic parsing of formal structures in written texts.It begins with a presentation of documents in their multilingual dimension and ofthe necessity to process them in this way. We study their multilingual structureand present how to compute it with the help of a language identification tool.Then, we present an original syntactic parsing method of unrestricted frenchsentences. This method is a generalization and an abstraction of Jacques Vergne'sresearches. The syntactic structures we are interested in are the minimal syntagmand the proposition ; both units can be defined as multilingual units so that themethod can be applied to various languages.We propose two processes which allow the building of these units. Both processesconsider texts as flows and build syntactic structures thanks to a relationalconstraints propagation. As the syntagmatic and propositional structures are dependent,they are built up by the interaction of the two processes. We show thatboth processes are identical if we disregard the nature of the unit they build upand the rule base they use.The main thread of this thesis is the method. Each time a process is described,we emphasize the related method. We show that this method is unique. Eachstructure is computed with the help of formal and positionnal clues: these cluescome from the study of the units located inside the structure (internal clues) orfrom the study of the function of the structure in its upper-level units (externalclues).Cette thèse traite de l'analyse automatique de structures formelles de l'écrit.Elle commence par une excursion dans le multilinguisme au cours de laquelle nousprésentons les documents dans leur dimension multilingue et montrons la nécessitéde les traiter comme tels. Nous étudions leur structure multilingue et développonscomment la calculer à l'aide d'un identificateur de langues.Nous poursuivons par l'exposé d'une méthode originale d'analyse syntaxiqueautomatique d'énoncés français tout-venants. Cette méthode est issue de nos travauxde généralisation et d'abstraction des recherches de Jacques Vergne. Lesstructures syntaxiques auxquelles nous nous sommes particulièrement intéressésont le syntagme minimal et la proposition ; deux unités auxquelles il est possibled'associer une définition ayant une validité multilingue, ce qui rend la méthodeapplicable à diverses langues.Nous proposons deux processus permettant la construction de ces unités. Cesprocessus considèrent les énoncés comme des flux textuels et construisent chacunleurs structures syntaxiques par propagation de contraintes relationnelles. Lesstructures intra-syntagmatique et intra-propositionnelle étant dépendantes, ellessont construites par l'interaction des deux processus, le second processus acceptantde travailler sur des unités partiellement définies. Enfin, nous montrons queles deux processus sont identiques si l'on fait abstraction de la nature de l'unitéqu'ils construisent et de la base de règles qu'ils manipulent.Le fil conducteur de cette thèse est la méthode. A chaque calcul de structure,nous mettons en effet l'accent sur la méthode ayant permis son obtention. Nousmontrons que cette méthode est unique. Chaque structure est en effet calculée àpartir d'indices formels et positionnels à la fois internes et externes : internes parl'étude des unités qui composent la structure, externes par l'étude du rôle de cettestructure dans l'unité qui l'intègre
Méthode pour l'analyse automatique de structures formelles sur documents multilingues
This thesis deals with automatic parsing of formal structures in written texts.It begins with a presentation of documents in their multilingual dimension and ofthe necessity to process them in this way. We study their multilingual structureand present how to compute it with the help of a language identification tool.Then, we present an original syntactic parsing method of unrestricted frenchsentences. This method is a generalization and an abstraction of Jacques Vergne'sresearches. The syntactic structures we are interested in are the minimal syntagmand the proposition ; both units can be defined as multilingual units so that themethod can be applied to various languages.We propose two processes which allow the building of these units. Both processesconsider texts as flows and build syntactic structures thanks to a relationalconstraints propagation. As the syntagmatic and propositional structures are dependent,they are built up by the interaction of the two processes. We show thatboth processes are identical if we disregard the nature of the unit they build upand the rule base they use.The main thread of this thesis is the method. Each time a process is described,we emphasize the related method. We show that this method is unique. Eachstructure is computed with the help of formal and positionnal clues: these cluescome from the study of the units located inside the structure (internal clues) orfrom the study of the function of the structure in its upper-level units (externalclues).Cette thèse traite de l'analyse automatique de structures formelles de l'écrit.Elle commence par une excursion dans le multilinguisme au cours de laquelle nousprésentons les documents dans leur dimension multilingue et montrons la nécessitéde les traiter comme tels. Nous étudions leur structure multilingue et développonscomment la calculer à l'aide d'un identificateur de langues.Nous poursuivons par l'exposé d'une méthode originale d'analyse syntaxiqueautomatique d'énoncés français tout-venants. Cette méthode est issue de nos travauxde généralisation et d'abstraction des recherches de Jacques Vergne. Lesstructures syntaxiques auxquelles nous nous sommes particulièrement intéressésont le syntagme minimal et la proposition ; deux unités auxquelles il est possibled'associer une définition ayant une validité multilingue, ce qui rend la méthodeapplicable à diverses langues.Nous proposons deux processus permettant la construction de ces unités. Cesprocessus considèrent les énoncés comme des flux textuels et construisent chacunleurs structures syntaxiques par propagation de contraintes relationnelles. Lesstructures intra-syntagmatique et intra-propositionnelle étant dépendantes, ellessont construites par l'interaction des deux processus, le second processus acceptantde travailler sur des unités partiellement définies. Enfin, nous montrons queles deux processus sont identiques si l'on fait abstraction de la nature de l'unitéqu'ils construisent et de la base de règles qu'ils manipulent.Le fil conducteur de cette thèse est la méthode. A chaque calcul de structure,nous mettons en effet l'accent sur la méthode ayant permis son obtention. Nousmontrons que cette méthode est unique. Chaque structure est en effet calculée àpartir d'indices formels et positionnels à la fois internes et externes : internes parl'étude des unités qui composent la structure, externes par l'étude du rôle de cettestructure dans l'unité qui l'intègre
Multilingual Sentence Categorization according to Language
International audienceIssues in sentence categorization according to language is fundamental for NLP, especially in document processing. In fact, with the growing amount of multilingual text corpus data becoming available, sentence categorization, leading to multilingual text structure, opens a wide range of applications in multilingual text analysis such as information retrieval or preprocessing of multilingual syntactic parser
- …